Loans exploration by Vladyslav Babyč

This report explores a dataset containing details for approximately 114,000 loans. The dataset used was provided by Prosper company.

## [1] 113937     81
## 'data.frame':    113937 obs. of  81 variables:
##  $ ListingKey                         : chr  "1021339766868145413AB3B" "10273602499503308B223C1" "0EE9337825851032864889A" "0EF5356002482715299901A" ...
##  $ ListingNumber                      : int  193129 1209647 81716 658116 909464 1074836 750899 768193 1023355 1023355 ...
##  $ ListingCreationDate                : chr  "2007-08-26 19:09:29.263000000" "2014-02-27 08:28:07.900000000" "2007-01-05 15:00:47.090000000" "2012-10-22 11:02:35.010000000" ...
##  $ CreditGrade                        : chr  "C" "" "HR" "" ...
##  $ Term                               : int  36 36 36 36 36 60 36 36 36 36 ...
##  $ LoanStatus                         : chr  "Completed" "Current" "Completed" "Current" ...
##  $ ClosedDate                         : chr  "2009-08-14 00:00:00" "" "2009-12-17 00:00:00" "" ...
##  $ BorrowerAPR                        : num  0.165 0.12 0.283 0.125 0.246 ...
##  $ BorrowerRate                       : num  0.158 0.092 0.275 0.0974 0.2085 ...
##  $ LenderYield                        : num  0.138 0.082 0.24 0.0874 0.1985 ...
##  $ EstimatedEffectiveYield            : num  NA 0.0796 NA 0.0849 0.1832 ...
##  $ EstimatedLoss                      : num  NA 0.0249 NA 0.0249 0.0925 ...
##  $ EstimatedReturn                    : num  NA 0.0547 NA 0.06 0.0907 ...
##  $ ProsperRating..numeric.            : int  NA 6 NA 6 3 5 2 4 7 7 ...
##  $ ProsperRating..Alpha.              : chr  "" "A" "" "A" ...
##  $ ProsperScore                       : num  NA 7 NA 9 4 10 2 4 9 11 ...
##  $ ListingCategory..numeric.          : int  0 2 0 16 2 1 1 2 7 7 ...
##  $ BorrowerState                      : chr  "CO" "CO" "GA" "GA" ...
##  $ Occupation                         : chr  "Other" "Professional" "Other" "Skilled Labor" ...
##  $ EmploymentStatus                   : chr  "Self-employed" "Employed" "Not available" "Employed" ...
##  $ EmploymentStatusDuration           : int  2 44 NA 113 44 82 172 103 269 269 ...
##  $ IsBorrowerHomeowner                : chr  "True" "False" "False" "True" ...
##  $ CurrentlyInGroup                   : chr  "True" "False" "True" "False" ...
##  $ GroupKey                           : chr  "" "" "783C3371218786870A73D20" "" ...
##  $ DateCreditPulled                   : chr  "2007-08-26 18:41:46.780000000" "2014-02-27 08:28:14" "2007-01-02 14:09:10.060000000" "2012-10-22 11:02:32" ...
##  $ CreditScoreRangeLower              : int  640 680 480 800 680 740 680 700 820 820 ...
##  $ CreditScoreRangeUpper              : int  659 699 499 819 699 759 699 719 839 839 ...
##  $ FirstRecordedCreditLine            : chr  "2001-10-11 00:00:00" "1996-03-18 00:00:00" "2002-07-27 00:00:00" "1983-02-28 00:00:00" ...
##  $ CurrentCreditLines                 : int  5 14 NA 5 19 21 10 6 17 17 ...
##  $ OpenCreditLines                    : int  4 14 NA 5 19 17 7 6 16 16 ...
##  $ TotalCreditLinespast7years         : int  12 29 3 29 49 49 20 10 32 32 ...
##  $ OpenRevolvingAccounts              : int  1 13 0 7 6 13 6 5 12 12 ...
##  $ OpenRevolvingMonthlyPayment        : num  24 389 0 115 220 1410 214 101 219 219 ...
##  $ InquiriesLast6Months               : int  3 3 0 0 1 0 0 3 1 1 ...
##  $ TotalInquiries                     : num  3 5 1 1 9 2 0 16 6 6 ...
##  $ CurrentDelinquencies               : int  2 0 1 4 0 0 0 0 0 0 ...
##  $ AmountDelinquent                   : num  472 0 NA 10056 0 ...
##  $ DelinquenciesLast7Years            : int  4 0 0 14 0 0 0 0 0 0 ...
##  $ PublicRecordsLast10Years           : int  0 1 0 0 0 0 0 1 0 0 ...
##  $ PublicRecordsLast12Months          : int  0 0 NA 0 0 0 0 0 0 0 ...
##  $ RevolvingCreditBalance             : num  0 3989 NA 1444 6193 ...
##  $ BankcardUtilization                : num  0 0.21 NA 0.04 0.81 0.39 0.72 0.13 0.11 0.11 ...
##  $ AvailableBankcardCredit            : num  1500 10266 NA 30754 695 ...
##  $ TotalTrades                        : num  11 29 NA 26 39 47 16 10 29 29 ...
##  $ TradesNeverDelinquent..percentage. : num  0.81 1 NA 0.76 0.95 1 0.68 0.8 1 1 ...
##  $ TradesOpenedLast6Months            : num  0 2 NA 0 2 0 0 0 1 1 ...
##  $ DebtToIncomeRatio                  : num  0.17 0.18 0.06 0.15 0.26 0.36 0.27 0.24 0.25 0.25 ...
##  $ IncomeRange                        : chr  "$25,000-49,999" "$50,000-74,999" "Not displayed" "$25,000-49,999" ...
##  $ IncomeVerifiable                   : chr  "True" "True" "True" "True" ...
##  $ StatedMonthlyIncome                : num  3083 6125 2083 2875 9583 ...
##  $ LoanKey                            : chr  "E33A3400205839220442E84" "9E3B37071505919926B1D82" "6954337960046817851BCB2" "A0393664465886295619C51" ...
##  $ TotalProsperLoans                  : int  NA NA NA NA 1 NA NA NA NA NA ...
##  $ TotalProsperPaymentsBilled         : int  NA NA NA NA 11 NA NA NA NA NA ...
##  $ OnTimeProsperPayments              : int  NA NA NA NA 11 NA NA NA NA NA ...
##  $ ProsperPaymentsLessThanOneMonthLate: int  NA NA NA NA 0 NA NA NA NA NA ...
##  $ ProsperPaymentsOneMonthPlusLate    : int  NA NA NA NA 0 NA NA NA NA NA ...
##  $ ProsperPrincipalBorrowed           : num  NA NA NA NA 11000 NA NA NA NA NA ...
##  $ ProsperPrincipalOutstanding        : num  NA NA NA NA 9948 ...
##  $ ScorexChangeAtTimeOfListing        : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ LoanCurrentDaysDelinquent          : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ LoanFirstDefaultedCycleNumber      : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ LoanMonthsSinceOrigination         : int  78 0 86 16 6 3 11 10 3 3 ...
##  $ LoanNumber                         : int  19141 134815 6466 77296 102670 123257 88353 90051 121268 121268 ...
##  $ LoanOriginalAmount                 : int  9425 10000 3001 10000 15000 15000 3000 10000 10000 10000 ...
##  $ LoanOriginationDate                : chr  "2007-09-12 00:00:00" "2014-03-03 00:00:00" "2007-01-17 00:00:00" "2012-11-01 00:00:00" ...
##  $ LoanOriginationQuarter             : chr  "Q3 2007" "Q1 2014" "Q1 2007" "Q4 2012" ...
##  $ MemberKey                          : chr  "1F3E3376408759268057EDA" "1D13370546739025387B2F4" "5F7033715035555618FA612" "9ADE356069835475068C6D2" ...
##  $ MonthlyLoanPayment                 : num  330 319 123 321 564 ...
##  $ LP_CustomerPayments                : num  11396 0 4187 5143 2820 ...
##  $ LP_CustomerPrincipalPayments       : num  9425 0 3001 4091 1563 ...
##  $ LP_InterestandFees                 : num  1971 0 1186 1052 1257 ...
##  $ LP_ServiceFees                     : num  -133.2 0 -24.2 -108 -60.3 ...
##  $ LP_CollectionFees                  : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ LP_GrossPrincipalLoss              : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ LP_NetPrincipalLoss                : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ LP_NonPrincipalRecoverypayments    : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ PercentFunded                      : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ Recommendations                    : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ InvestmentFromFriendsCount         : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ InvestmentFromFriendsAmount        : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Investors                          : int  258 1 41 158 20 1 1 1 1 1 ...
##   ListingKey        ListingNumber     ListingCreationDate CreditGrade       
##  Length:113937      Min.   :      4   Length:113937       Length:113937     
##  Class :character   1st Qu.: 400919   Class :character    Class :character  
##  Mode  :character   Median : 600554   Mode  :character    Mode  :character  
##                     Mean   : 627886                                         
##                     3rd Qu.: 892634                                         
##                     Max.   :1255725                                         
##                                                                             
##       Term        LoanStatus         ClosedDate         BorrowerAPR     
##  Min.   :12.00   Length:113937      Length:113937      Min.   :0.00653  
##  1st Qu.:36.00   Class :character   Class :character   1st Qu.:0.15629  
##  Median :36.00   Mode  :character   Mode  :character   Median :0.20976  
##  Mean   :40.83                                         Mean   :0.21883  
##  3rd Qu.:36.00                                         3rd Qu.:0.28381  
##  Max.   :60.00                                         Max.   :0.51229  
##                                                        NA's   :25       
##   BorrowerRate     LenderYield      EstimatedEffectiveYield EstimatedLoss  
##  Min.   :0.0000   Min.   :-0.0100   Min.   :-0.183          Min.   :0.005  
##  1st Qu.:0.1340   1st Qu.: 0.1242   1st Qu.: 0.116          1st Qu.:0.042  
##  Median :0.1840   Median : 0.1730   Median : 0.162          Median :0.072  
##  Mean   :0.1928   Mean   : 0.1827   Mean   : 0.169          Mean   :0.080  
##  3rd Qu.:0.2500   3rd Qu.: 0.2400   3rd Qu.: 0.224          3rd Qu.:0.112  
##  Max.   :0.4975   Max.   : 0.4925   Max.   : 0.320          Max.   :0.366  
##                                     NA's   :29084           NA's   :29084  
##  EstimatedReturn  ProsperRating..numeric. ProsperRating..Alpha.  ProsperScore  
##  Min.   :-0.183   Min.   :1.000           Length:113937         Min.   : 1.00  
##  1st Qu.: 0.074   1st Qu.:3.000           Class :character      1st Qu.: 4.00  
##  Median : 0.092   Median :4.000           Mode  :character      Median : 6.00  
##  Mean   : 0.096   Mean   :4.072                                 Mean   : 5.95  
##  3rd Qu.: 0.117   3rd Qu.:5.000                                 3rd Qu.: 8.00  
##  Max.   : 0.284   Max.   :7.000                                 Max.   :11.00  
##  NA's   :29084    NA's   :29084                                 NA's   :29084  
##  ListingCategory..numeric. BorrowerState       Occupation       
##  Min.   : 0.000            Length:113937      Length:113937     
##  1st Qu.: 1.000            Class :character   Class :character  
##  Median : 1.000            Mode  :character   Mode  :character  
##  Mean   : 2.774                                                 
##  3rd Qu.: 3.000                                                 
##  Max.   :20.000                                                 
##                                                                 
##  EmploymentStatus   EmploymentStatusDuration IsBorrowerHomeowner
##  Length:113937      Min.   :  0.00           Length:113937      
##  Class :character   1st Qu.: 26.00           Class :character   
##  Mode  :character   Median : 67.00           Mode  :character   
##                     Mean   : 96.07                              
##                     3rd Qu.:137.00                              
##                     Max.   :755.00                              
##                     NA's   :7625                                
##  CurrentlyInGroup     GroupKey         DateCreditPulled   CreditScoreRangeLower
##  Length:113937      Length:113937      Length:113937      Min.   :  0.0        
##  Class :character   Class :character   Class :character   1st Qu.:660.0        
##  Mode  :character   Mode  :character   Mode  :character   Median :680.0        
##                                                           Mean   :685.6        
##                                                           3rd Qu.:720.0        
##                                                           Max.   :880.0        
##                                                           NA's   :591          
##  CreditScoreRangeUpper FirstRecordedCreditLine CurrentCreditLines
##  Min.   : 19.0         Length:113937           Min.   : 0.00     
##  1st Qu.:679.0         Class :character        1st Qu.: 7.00     
##  Median :699.0         Mode  :character        Median :10.00     
##  Mean   :704.6                                 Mean   :10.32     
##  3rd Qu.:739.0                                 3rd Qu.:13.00     
##  Max.   :899.0                                 Max.   :59.00     
##  NA's   :591                                   NA's   :7604      
##  OpenCreditLines TotalCreditLinespast7years OpenRevolvingAccounts
##  Min.   : 0.00   Min.   :  2.00             Min.   : 0.00        
##  1st Qu.: 6.00   1st Qu.: 17.00             1st Qu.: 4.00        
##  Median : 9.00   Median : 25.00             Median : 6.00        
##  Mean   : 9.26   Mean   : 26.75             Mean   : 6.97        
##  3rd Qu.:12.00   3rd Qu.: 35.00             3rd Qu.: 9.00        
##  Max.   :54.00   Max.   :136.00             Max.   :51.00        
##  NA's   :7604    NA's   :697                                     
##  OpenRevolvingMonthlyPayment InquiriesLast6Months TotalInquiries   
##  Min.   :    0.0             Min.   :  0.000      Min.   :  0.000  
##  1st Qu.:  114.0             1st Qu.:  0.000      1st Qu.:  2.000  
##  Median :  271.0             Median :  1.000      Median :  4.000  
##  Mean   :  398.3             Mean   :  1.435      Mean   :  5.584  
##  3rd Qu.:  525.0             3rd Qu.:  2.000      3rd Qu.:  7.000  
##  Max.   :14985.0             Max.   :105.000      Max.   :379.000  
##                              NA's   :697          NA's   :1159     
##  CurrentDelinquencies AmountDelinquent   DelinquenciesLast7Years
##  Min.   : 0.0000      Min.   :     0.0   Min.   : 0.000         
##  1st Qu.: 0.0000      1st Qu.:     0.0   1st Qu.: 0.000         
##  Median : 0.0000      Median :     0.0   Median : 0.000         
##  Mean   : 0.5921      Mean   :   984.5   Mean   : 4.155         
##  3rd Qu.: 0.0000      3rd Qu.:     0.0   3rd Qu.: 3.000         
##  Max.   :83.0000      Max.   :463881.0   Max.   :99.000         
##  NA's   :697          NA's   :7622       NA's   :990            
##  PublicRecordsLast10Years PublicRecordsLast12Months RevolvingCreditBalance
##  Min.   : 0.0000          Min.   : 0.000            Min.   :      0       
##  1st Qu.: 0.0000          1st Qu.: 0.000            1st Qu.:   3121       
##  Median : 0.0000          Median : 0.000            Median :   8549       
##  Mean   : 0.3126          Mean   : 0.015            Mean   :  17599       
##  3rd Qu.: 0.0000          3rd Qu.: 0.000            3rd Qu.:  19521       
##  Max.   :38.0000          Max.   :20.000            Max.   :1435667       
##  NA's   :697              NA's   :7604              NA's   :7604          
##  BankcardUtilization AvailableBankcardCredit  TotalTrades    
##  Min.   :0.000       Min.   :     0          Min.   :  0.00  
##  1st Qu.:0.310       1st Qu.:   880          1st Qu.: 15.00  
##  Median :0.600       Median :  4100          Median : 22.00  
##  Mean   :0.561       Mean   : 11210          Mean   : 23.23  
##  3rd Qu.:0.840       3rd Qu.: 13180          3rd Qu.: 30.00  
##  Max.   :5.950       Max.   :646285          Max.   :126.00  
##  NA's   :7604        NA's   :7544            NA's   :7544    
##  TradesNeverDelinquent..percentage. TradesOpenedLast6Months DebtToIncomeRatio
##  Min.   :0.000                      Min.   : 0.000          Min.   : 0.000   
##  1st Qu.:0.820                      1st Qu.: 0.000          1st Qu.: 0.140   
##  Median :0.940                      Median : 0.000          Median : 0.220   
##  Mean   :0.886                      Mean   : 0.802          Mean   : 0.276   
##  3rd Qu.:1.000                      3rd Qu.: 1.000          3rd Qu.: 0.320   
##  Max.   :1.000                      Max.   :20.000          Max.   :10.010   
##  NA's   :7544                       NA's   :7544            NA's   :8554     
##  IncomeRange        IncomeVerifiable   StatedMonthlyIncome   LoanKey         
##  Length:113937      Length:113937      Min.   :      0     Length:113937     
##  Class :character   Class :character   1st Qu.:   3200     Class :character  
##  Mode  :character   Mode  :character   Median :   4667     Mode  :character  
##                                        Mean   :   5608                       
##                                        3rd Qu.:   6825                       
##                                        Max.   :1750003                       
##                                                                              
##  TotalProsperLoans TotalProsperPaymentsBilled OnTimeProsperPayments
##  Min.   :0.00      Min.   :  0.00             Min.   :  0.00       
##  1st Qu.:1.00      1st Qu.:  9.00             1st Qu.:  9.00       
##  Median :1.00      Median : 16.00             Median : 15.00       
##  Mean   :1.42      Mean   : 22.93             Mean   : 22.27       
##  3rd Qu.:2.00      3rd Qu.: 33.00             3rd Qu.: 32.00       
##  Max.   :8.00      Max.   :141.00             Max.   :141.00       
##  NA's   :91852     NA's   :91852              NA's   :91852        
##  ProsperPaymentsLessThanOneMonthLate ProsperPaymentsOneMonthPlusLate
##  Min.   : 0.00                       Min.   : 0.00                  
##  1st Qu.: 0.00                       1st Qu.: 0.00                  
##  Median : 0.00                       Median : 0.00                  
##  Mean   : 0.61                       Mean   : 0.05                  
##  3rd Qu.: 0.00                       3rd Qu.: 0.00                  
##  Max.   :42.00                       Max.   :21.00                  
##  NA's   :91852                       NA's   :91852                  
##  ProsperPrincipalBorrowed ProsperPrincipalOutstanding
##  Min.   :    0            Min.   :    0              
##  1st Qu.: 3500            1st Qu.:    0              
##  Median : 6000            Median : 1627              
##  Mean   : 8472            Mean   : 2930              
##  3rd Qu.:11000            3rd Qu.: 4127              
##  Max.   :72499            Max.   :23451              
##  NA's   :91852            NA's   :91852              
##  ScorexChangeAtTimeOfListing LoanCurrentDaysDelinquent
##  Min.   :-209.00             Min.   :   0.0           
##  1st Qu.: -35.00             1st Qu.:   0.0           
##  Median :  -3.00             Median :   0.0           
##  Mean   :  -3.22             Mean   : 152.8           
##  3rd Qu.:  25.00             3rd Qu.:   0.0           
##  Max.   : 286.00             Max.   :2704.0           
##  NA's   :95009                                        
##  LoanFirstDefaultedCycleNumber LoanMonthsSinceOrigination   LoanNumber    
##  Min.   : 0.00                 Min.   :  0.0              Min.   :     1  
##  1st Qu.: 9.00                 1st Qu.:  6.0              1st Qu.: 37332  
##  Median :14.00                 Median : 21.0              Median : 68599  
##  Mean   :16.27                 Mean   : 31.9              Mean   : 69444  
##  3rd Qu.:22.00                 3rd Qu.: 65.0              3rd Qu.:101901  
##  Max.   :44.00                 Max.   :100.0              Max.   :136486  
##  NA's   :96985                                                            
##  LoanOriginalAmount LoanOriginationDate LoanOriginationQuarter
##  Min.   : 1000      Length:113937       Length:113937         
##  1st Qu.: 4000      Class :character    Class :character      
##  Median : 6500      Mode  :character    Mode  :character      
##  Mean   : 8337                                                
##  3rd Qu.:12000                                                
##  Max.   :35000                                                
##                                                               
##   MemberKey         MonthlyLoanPayment LP_CustomerPayments
##  Length:113937      Min.   :   0.0     Min.   :   -2.35   
##  Class :character   1st Qu.: 131.6     1st Qu.: 1005.76   
##  Mode  :character   Median : 217.7     Median : 2583.83   
##                     Mean   : 272.5     Mean   : 4183.08   
##                     3rd Qu.: 371.6     3rd Qu.: 5548.40   
##                     Max.   :2251.5     Max.   :40702.39   
##                                                           
##  LP_CustomerPrincipalPayments LP_InterestandFees LP_ServiceFees   
##  Min.   :    0.0              Min.   :   -2.35   Min.   :-664.87  
##  1st Qu.:  500.9              1st Qu.:  274.87   1st Qu.: -73.18  
##  Median : 1587.5              Median :  700.84   Median : -34.44  
##  Mean   : 3105.5              Mean   : 1077.54   Mean   : -54.73  
##  3rd Qu.: 4000.0              3rd Qu.: 1458.54   3rd Qu.: -13.92  
##  Max.   :35000.0              Max.   :15617.03   Max.   :  32.06  
##                                                                   
##  LP_CollectionFees  LP_GrossPrincipalLoss LP_NetPrincipalLoss
##  Min.   :-9274.75   Min.   :  -94.2       Min.   : -954.5    
##  1st Qu.:    0.00   1st Qu.:    0.0       1st Qu.:    0.0    
##  Median :    0.00   Median :    0.0       Median :    0.0    
##  Mean   :  -14.24   Mean   :  700.4       Mean   :  681.4    
##  3rd Qu.:    0.00   3rd Qu.:    0.0       3rd Qu.:    0.0    
##  Max.   :    0.00   Max.   :25000.0       Max.   :25000.0    
##                                                              
##  LP_NonPrincipalRecoverypayments PercentFunded    Recommendations   
##  Min.   :    0.00                Min.   :0.7000   Min.   : 0.00000  
##  1st Qu.:    0.00                1st Qu.:1.0000   1st Qu.: 0.00000  
##  Median :    0.00                Median :1.0000   Median : 0.00000  
##  Mean   :   25.14                Mean   :0.9986   Mean   : 0.04803  
##  3rd Qu.:    0.00                3rd Qu.:1.0000   3rd Qu.: 0.00000  
##  Max.   :21117.90                Max.   :1.0125   Max.   :39.00000  
##                                                                     
##  InvestmentFromFriendsCount InvestmentFromFriendsAmount   Investors      
##  Min.   : 0.00000           Min.   :    0.00            Min.   :   1.00  
##  1st Qu.: 0.00000           1st Qu.:    0.00            1st Qu.:   2.00  
##  Median : 0.00000           Median :    0.00            Median :  44.00  
##  Mean   : 0.02346           Mean   :   16.55            Mean   :  80.48  
##  3rd Qu.: 0.00000           3rd Qu.:    0.00            3rd Qu.: 115.00  
##  Max.   :33.00000           Max.   :25000.00            Max.   :1189.00  
## 

Our dataset consists of 81 variables, with almost 114,000 observations. # Univariate Plots Section

The count of plots does not seem that it has some common distribution. It seems to me that the amount of lendet money is pretty random. Of course we can see that the majority of loans is made for a “small” amount of money, but we can also see some peaks at around 10,000$ and 15,000$. It is also interesting to see that there is a gap between 25,000$ and 30,000$. I wonder hot this plot will look like with the categorical variables of Employment status, whether the borrower is homeowner and what income does borrower have.

Based on this plots we can see that most people who take loans are employed, it does not depend if borrower is homeowner, because the number of lenders is same in both categories. However, we can see interesting distribution of income levels.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     0.0   131.6   217.7   272.5   371.6  2251.5

After discovering that in the monthly payment is 0$ in some loans I might omit these values in future observations. Also we can see that the highest monthly payment is 2251.5$ which is pretty high. I am also interested if these monthly payments are somehow related to the APR (annual percentage rate) on concrete loan.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   0.653  15.629  20.976  21.883  28.381  51.229      25

I have transformed BorrowerAPR variable into BorroweAPRinPercent variable, because for me it is better for understanding in that numerical form. I hope for you too. There also were some NA values, so I replaced them with the median of the value. We see that the percentages of APR vary a lot from 10% to 40%.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    0.00    6.00    9.00    9.26   12.00   54.00    7604

Based on this plot we can see that the most people tend to have between 6 and 12 credit lines open, which seems a lot too me.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       0    3200    4667    5608    6825 1750003

We could not see a lot of information based on first graph. However, after creating a summary and limiting x axis to the 0.95 quantile the graph gets clearer. Also this graph confirms that the ranges of income are correct, because salary has equivalent distribution.

To understand which in which year people took loans the most we can look at the LoanOriginationYear. Which is a new varible which I created from LoanOriginationDate.

Also it is interesting to see whether there is a dependency between month of a year and borrowing money (taking a loan).

From the first graph of DebtToIncomeRatio we do not see a lot, because there are some extreme values. On the second graph if we change the x scale the graph gets clearer. We can also examine DebtToIncome ratio and see if there is some trend between this value and loan taking.

Last but not least thing to look at is the loan status, we can see that in this dataset the most of the loans are in the current state, which means that they are active, and the second biggest chunk is in the status of completed.

Univariate Analysis

What is the structure of your dataset?

There are 113,937 loans in the dataset with 84 features (originally there were 81, but I added some features for the better understanding of dataset), in my opinion the most interestring of them are: LoanOriginalAmount, EmploymentStatus, IsBorrowerHomeowner, IncomeRange, MonthlyLoanPayment, BorrowerAPRinPercent, OpenCreditLines, StatedMonthlyIncome, LoanOriginationYear, LoanOriginationMonth, DebtToIncomeRatio, LoanStatus. We can see that there is no some distribution of the ammount of loans, but in BorrowerAPR there is slightly pattern which may remind normal distribution. Other observation:

  • The median of total loan amount is 6500$.
  • There is normal distribution of incomes of people.
  • Most of the people state that they are employed.
  • Most of the people have MonthlyLoan payment smaller than 750$.
  • We can see that it was in the year 2013 when the majority of loans were taken.

What is/are the main feature(s) of interest in your dataset?

The main features are LoanOriginalAmount, BorrowerAPR and Term of loan. Based on that information we can calculate how much money will person pay, and in case that the person is not paying in time we can calculate how much will the person overpay for that loan, because of delay.

What other features in the dataset do you think will help support your investigation into your feature(s) of interest?

To support investigation the features as EmploymentStatus, IncomeRange and MonthlyLoanPayment might be useful. Also we can maybe look at loans which were not payed and find some dependency to predict if the loan requster will be able to pay it or no.

Did you create any new variables from existing variables in the dataset?

Yes I created three new variables, I transformed BorrowerAPR to BorrowerAPRinPercent to better understand the numbers in terms of graphs (for me and maybe for somebody else these numbers may be more readable than the decimals). Also I have split the LoanOriginationDate into months and years to find some trend in these data.

Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

I did not see any unusual distribution. I hoped that I will see more trends going on and be able to spot some trends in this data, but it looks like the loans are very different based on each person. In my first part of this document I had to scale axis a lot to better understand the data in the graphs, also I had to limit some observations to exclude extreme values so these did not interfere with the observed data. I dont say that these extreme data are wrong, they might be right observations, but I wanted to look into majority of dataset. These extreme data might be valuable in the following parts of research.

Bivariate Plots Section

##                      OpenCreditLines DebtToIncomeRatio StatedMonthlyIncome
## OpenCreditLines                    1                NA                  NA
## DebtToIncomeRatio                 NA                 1                  NA
## StatedMonthlyIncome               NA                NA                1.00
## LoanOriginalAmount                NA                NA                0.35
## MonthlyLoanPayment                NA                NA                0.34
## BorrowerAPRinPercent              NA                NA                  NA
## LoanOriginationYear               NA                NA                0.14
## LoanOriginationMonth              NA                NA                0.00
##                      LoanOriginalAmount MonthlyLoanPayment BorrowerAPRinPercent
## OpenCreditLines                      NA                 NA                   NA
## DebtToIncomeRatio                    NA                 NA                   NA
## StatedMonthlyIncome                0.35               0.34                   NA
## LoanOriginalAmount                 1.00               0.94                   NA
## MonthlyLoanPayment                 0.94               1.00                   NA
## BorrowerAPRinPercent                 NA                 NA                    1
## LoanOriginationYear                0.31               0.26                   NA
## LoanOriginationMonth              -0.02              -0.01                   NA
##                      LoanOriginationYear LoanOriginationMonth
## OpenCreditLines                       NA                   NA
## DebtToIncomeRatio                     NA                   NA
## StatedMonthlyIncome                 0.14                 0.00
## LoanOriginalAmount                  0.31                -0.02
## MonthlyLoanPayment                  0.26                -0.01
## BorrowerAPRinPercent                  NA                   NA
## LoanOriginationYear                 1.00                -0.10
## LoanOriginationMonth               -0.10                 1.00

We can see some basic correlations from the summary above, but lets examine the data in more detail.

We can see that MonthlyLoanPayment has strong correlation with LoanOriginalAmount, but it is also slightly corelted with loan origination year which is interesting. We can also see that Stated monthly income is moderately correlated with LoanOiriginal amount.

We can see that there are extremely high lines in some of the LoanOriginations amounts in 150,000 and 250,000. It might be intresting to look into detail later in the research. However, we cannot see any pattern in this data, only thing that I spotted is that people have to have at least 7500 to be able to get a loan in amount greater than 250,000.

These graphs state interesting trend in the loans and total payments. We can see that on the first graph the relationship is linear, but on the second graph it becomes more linear, these gives us interesting insides to data. Moving to the third graph, we can see again the linear relationship. I think that it is quite interesting to see changing of relationship based just on limiting the axis.

I also wanted to examine how if there is some dependency between availableBankcardCredit and utilization of that bankcard. However from the graph we cannot derive any conclusion.

Now let’s move on to some categorical variables.

## loans$IncomeRange: $0
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1000    2500    5000    7411   10000   25000 
## ------------------------------------------------------------ 
## loans$IncomeRange: $1-24,999
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1000    2052    4000    4274    5000   25000 
## ------------------------------------------------------------ 
## loans$IncomeRange: $25,000-49,999
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1000    3000    5000    6178    9800   25000 
## ------------------------------------------------------------ 
## loans$IncomeRange: $50,000-74,999
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1000    4000    7500    8675   13500   25000 
## ------------------------------------------------------------ 
## loans$IncomeRange: $75,000-99,999
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1000    4000    9700   10366   15000   25000 
## ------------------------------------------------------------ 
## loans$IncomeRange: $100,000+
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1000    6000   12000   13073   18500   35000 
## ------------------------------------------------------------ 
## loans$IncomeRange: Not displayed
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1000    2100    3033    5170    6001   25000 
## ------------------------------------------------------------ 
## loans$IncomeRange: Not employed
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1000    2500    4000    4885    6000   25000

It looks like there is a trend that people which have higher income range tend to loan more money, this trend has one exclusion which are the people which have 0$ monthly income.

Again based on this plot we cannot come to any conclusion about the trend, I think that these data only show us that all people tent to get loans. Only thing that we see is that employed people get loans more.

## loans$EmploymentStatus: 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1000    2000    3000    4563    5000   25000 
## ------------------------------------------------------------ 
## loans$EmploymentStatus: Employed
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1000    4000    9000    9794   15000   35000 
## ------------------------------------------------------------ 
## loans$EmploymentStatus: Full-time
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1000    2500    4950    6195    8000   35000 
## ------------------------------------------------------------ 
## loans$EmploymentStatus: Not available
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1000    2138    3225    5373    6300   25000 
## ------------------------------------------------------------ 
## loans$EmploymentStatus: Not employed
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1000    2500    4000    4873    6000   25000 
## ------------------------------------------------------------ 
## loans$EmploymentStatus: Other
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1000    4000    4000    6862   10000   35000 
## ------------------------------------------------------------ 
## loans$EmploymentStatus: Part-time
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1000    1600    3000    4089    5000   25000 
## ------------------------------------------------------------ 
## loans$EmploymentStatus: Retired
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1000    2000    3500    4784    6000   25000 
## ------------------------------------------------------------ 
## loans$EmploymentStatus: Self-employed
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1000    4000    7000    8123   11000   25000

It is also quite interesting to see from the statistics, that people who are not employed have same median of Original Loan amount as people with other employment status. However if we look at higher quartiles we see big differences between these two groups.

## loans$LoanStatus: Cancelled
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1000    1000    1000    1700    2500    3000 
## ------------------------------------------------------------ 
## loans$LoanStatus: Chargedoff
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1000    3000    4500    6399    8000   25000 
## ------------------------------------------------------------ 
## loans$LoanStatus: Completed
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1000    2550    4500    6189    8000   35000 
## ------------------------------------------------------------ 
## loans$LoanStatus: Current
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    2000    4000   10000   10361   15000   35000 
## ------------------------------------------------------------ 
## loans$LoanStatus: Defaulted
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1000    2550    4275    6487    8000   25000 
## ------------------------------------------------------------ 
## loans$LoanStatus: FinalPaymentInProgress
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    2000    4000    6500    8346   10000   31000 
## ------------------------------------------------------------ 
## loans$LoanStatus: Past Due (1-15 days)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    2000    4000    7000    8468   12000   35000 
## ------------------------------------------------------------ 
## loans$LoanStatus: Past Due (16-30 days)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    2000    4000    6000    8156   11129   25000 
## ------------------------------------------------------------ 
## loans$LoanStatus: Past Due (31-60 days)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    2000    4000    6500    8534   10000   35000 
## ------------------------------------------------------------ 
## loans$LoanStatus: Past Due (61-90 days)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    2000    4000    6000    7730   10000   25000 
## ------------------------------------------------------------ 
## loans$LoanStatus: Past Due (91-120 days)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1500    4000    6000    8004   11000   25000 
## ------------------------------------------------------------ 
## loans$LoanStatus: Past Due (>120 days)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    2500    4000    7500    8281   11250   15500

From the graph and data above we can see, that the most loans are still in the current state and that the amounts of have the widest spread on x-scale. We can also see that loans which amount were not that big, were canceled which is I think a great think for the organization.

If we look at this graph we can see quite interesting thing. People take loans with lower amount in the middle of a year and to the end of a year the amount of loan grows.

Based on the years graph I suppose that there was a trend before the age 2008 when people took loans, because they had to or they just wanted to buy a property. However, we can see that after 2008 the amounts of loans were on the level of year 2006. From 2008 it took customers of this company 3 years to make comeback in the LoanOriginalAmount, finally in 2011 the company came to the same (slightly higher) amount of loans as in the 2008. From that year there was only a rise in the amount of loans which were taken, which I suppose might be because of inflation or maybe because of some other factors on the financial market.

Bivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

I have found relationship between LoanOriginalAmount and LP_CustomerPayments. I also find some relationship between IncomeRange and LoanOriginalAmount. The most surprising thing for me was the dependency which I did not expect to see. The dependency about which I am talking about is between month and LoanOriginal amount. Based on observations of other shown graphs, which made me think that every loan is very unique and that there are not many dependencies between a lot of provided variables, I was quite surprised to see a trend in the graph which showed that during the middle of a year the amounts of loans are not so high as in the months at the end and at the start of a year.

Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

I found interesting relationship between the years and months in which loans were taken. This was quite a surprising thing to see, because based on the other variables, there are not many dependencies between them.

What was the strongest relationship you found?

The strongest relationship which I found was between LP_CustomerPayments and LoanOriginalAmount which is quite logical, because the higher amount was the loan, the higher amount has person to pay in order to complete this loan. However, this is not as easy conclusion as it may seem. I found it interesting because there can be wide variety of lengths for loans, but even though there is this length parameter we see almost linear relationship between the total amount and customer payment.

Multivariate Plots Section

These plots above are pretty hard to read, because of a huge amount of data and extreme values in them. They show us the previous exploration, that there is only limited amount of data which is related to each other and due to extremes in loan cases. Let’s look at these same graphs in dots rather than in lines.

Here we can see better distribution of these graphs. Partially because, we have chosen really dependent variables as of LoanOriginalAmount and MonthlyLoanPayment.

I that particular graph we can see interesting thing, which is that people who are employed and become 50,000$+ tend to take higher loans and have higher monthly payment. We can spot that by looking at upper line which shows us that MonthlyLoanPayment of each graph gets steeper towards giher income ranges.

In the facet wrap by LoanStatus we spot relation, that completed loans had higher monthlyLoanPayment among all other graphs.

If we look at other variables and try to see some more depencdencies we cannot spot a lot of them. It seems that the theory of each loan being very individual is true.

Coming back to the dependency on time when the loan was taken we can se interesting thing in graph, in which colors determined by the year of a loan. We can clearly see that the majority of loans were taken more to the present time and that their amounts are higher than the amounts in the past.

Multivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?

It was the strengthened in my observation that based on a loan size the monthly payment increases. I also tried to find dependency in BorrowerAPR to connect it with some other variable but i didn’t see any dependency here.

Were there any interesting or surprising interactions between features?

I think that it is interesting to see that majority of loans with high amounts are still in the current state of loan.


Final Plots and Summary

Plot One

Description One

We cannot see any particular distribution of amount of loan, but we can see interesting peaks in the round number of amount of loans. It appears that a numerous loans have been taken out on round amounts such as 5,000$, 10,000$, 15,000$, 20,000$, 25,000$

Plot Two

Description Two

The lowest median for Status of a loan have loans which were canceled. On the other hand the highest median have loans with current state. Same proportion of loans are in the defaulted and completed state with the difference that completed loans have higher maximum amount of loan. In the graph we can also spot interesting pattern in the past due loans.

Plot Three

Description Three

The plot indicates that there is relation between Monthly payment of a loan and Loan original amount. Based on income range of person we can also see that wealthier people could afford taking a bigger amount of a loan as the color of dots suggest.


Reflection

The loans data set contains information on almost 114,000 loans, across 81 variables from around 2005. I started my observation by understanding basic variables presented in the data set and chose ones that were the most relevant for my EDA. I found myself asking questions to explore and dig deeper into a data. After performing an observation of a basic variables I combined them to see complex image of a story which these data provide.

In this data I have found strong relationship between Loan amount and monthly payment. Which may seem obvious, but there can be also differences in the term of a loan which change the monthly payment. However despite presence of this variable, the dependency is almost linear. Other interesting thing that I found is relatinship between loan taking and month of taking a loan. I discovered trends how the financial crisis in 2008 affected the loan market and also performed analysis of how variables in this data set are related.

I have come to conclusion that it is not easy to create a prediction model for approval of loan nor the prediction of amount of loan based on information provided about the person. After that I became even more fascinated with the algorithms which companies for providing loans have for assessment of their candidates. For me it seems like every case of loan is very individual and has to be handled respectfully to all occasions. If I were to continue on exploration of this data set I would try to perform an analysis, where I would look into variables such as credit card information, different indexes and so on. I also think that in this data set there is a lot of technical data but personally for me I would appreciate if there were more data on concrete person e.g. as age, education and others. This would maybe help to create a prediction model which might be useful to some extent.